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Abstract — We  investigate  a  route  exploration  problem  with 
N  agents  dropped  randomly  on  the  interval  10./;]  and  discuss 
the  impact  of  using  multiple  agents  to  perform  this  task.  We 
consider  both  a  discrete  and  a  continous  description  of  the  path 
to  explore.  Independantly,  we  study  an  exploration  problem 
with  probabilistic  agents  having  limited  autonomy.  In  both 
problems  ,  multi-agent  scenarios  are  discussed  with  an  emphasis 
on  the  number  of  agents  necessary  to  obtain  good  performance. 

I.  Introduction 

Consider  a  line  segment  of  length  b ,  with  coordinate  x 
describing  a  position  on  the  segment.  The  endpoints  are  x  = 
0  and  x  =  b.  The  coordinates  can  take  their  values  in  the 
discrete  set  [0,f>]  :=  {0, 1, . . . ,b},  in  which  case  we  obtain  a 
line  graph  with  (b  +  1)  equally  spaced  sites,  or  they  can  take 
their  value  in  the  continous  interval  [0,/?]. 

We  have  N  agents  with  initial  positions  xt ,  i  =  1  ...  A'. 
When  these  initial  positions  are  realizations  of  the  associated 
random  variables  X, .  i  =  1 . .  .N,  we  denote  the  corresponding 
order  statistics  (X\:n,X2 :n,  ■  ■  ■  ,Xn:n),  that  is,  the  variables  Xj 
arranged  in  increasing  order:  X\:^  <  AVy  <  ...  <  Xn-.n-  We 
assume  a  continuous  distribution  function  for  the  random 
variables  Xj,  and  therefore  we  have  P(Xl  =  Xj)  =  0  for  i  ^  j 
(see  for  example  [1]  p.29). 

The  agents  can  move  along  the  continous  line  with  the 
same  speed  v.  When  the  line  is  discrete,  we  also  discretize 
the  time:  in  that  case,  at  each  period  an  agent  can  either 
move  to  a  site  that  is  next  to  its  current  position,  or  remain 
at  its  current  position. 

Finally  we  also  consider  non-compliant  agents,  that  react 
probabilistically  to  given  controls.  More  precisely,  a  non- 
compliant  agent  demonstrates  the  following  behavior: 

•  in  the  continuous  case:  each  agent  moves  with  speed 
v'  +  crWi,  with  <7  a  constant  and  W,  1 -dimensional  white 
Gaussian  noise  with  unit  power  spectral  density. 

•  in  the  discrete  case:  when  we  tell  the  agent  to  move  one 
step  in  a  given  direction,  it  might  indeed  move  in  that 
direction  with  probability  p,  but  might  go  in  the  opposite 
direction  with  probability  q  <  p  and  also  stay  where  it 
is  with  probability  1  —  p  —  q.  To  a  one  step  displacement 
corresponds  a  random  variable  X:  its  mean  is  analogous 
to  the  “speed”  of  the  agent  and  therefore  we  write  v  = 
p  —  q.  Its  standard  deviation  is  o  =  p+q+2pq— p2  —  q2. 

v  and  a  play  a  similar  role  in  the  continous  and  the  discrete 
case,  therefore  we  use  the  same  notation  in  both  cases.  There 
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Fig.  1.  Problem  description  and  notation  in  the  case  of  two  agents. 


will  not  be  any  confusion  since  we  consider  these  models 
separately. 

Part  II  and  III  examine  how  to  optimally  explore  the  line 
with  randomly  dropped  deterministic  agents  for  a  specific 
cost  function,  and  how  many  agents  we  should  use.  Indepen¬ 
dantly,  part  IV  and  V  discuss  simple  exploration  strategies 
for  non-compliant  agents  as  well  as  trade-offs  appearing  in 
multi-agent  exploration  scenarios. 

II.  Deterministic  Optimal  Exploration  Policy 

In  this  part  we  consider  the  continous  model,  where  the 
agents  respond  deterministically  to  the  controls.  Extension 
to  the  discrete  case  is  straightforward.  We  assume  a  cost 
proportional  to  the  distance  that  each  agent  travels.  A  pos¬ 
sible  motivation  includes  the  risk  of  losing  agents  along  the 
path  in  a  hostile  environment,  increasing  as  the  agents  cover 
a  longer  distance.  Another  example  could  be  that  we  want 
to  minimize  the  amount  of  energy  used  by  each  agent.  In 
the  optimization  problem,  we  seek  to  minimize  the  sum  of 
the  distances  covered  by  all  the  agents.  We  have  N  agents 

initially  at  given  distinct  positions  0  <  xi  <  xj  < . < 

xn  <  b.  To  agent  i  1 , ...  ,1V,  we  assign  a  part  of  the  line  to 
explore,  called  Sj,  and  let  L,  =  min  S,  and  R,  =  max  Sj.  Fig.  1 
describes  the  notation  in  the  case  of  two  agents.  Each  agent 
explores  its  assigned  region  optimally  by  travelling  a  distance 
dj  =  [(Rj—Lj)  +  mm(xj  —  Lj,Rj  —  Xi)\,  that  is,  it  travels  to  the 
nearest  endpoint  first  and  then  to  the  opposite  endpoint. 

The  problem  of  minimum  cost  exploration  For  N  agents 
becomes  designing  each  set  Sj  so  that  when  each  of  them  is 
explored  optimally  by  the  corresponding  agent,  the  sum  of 
the  minimum  distances  is  minimized: 

minimize  [(Ri  ~  U)  +  min(x,-  -  Lj,Rj  -  x,j] 

subject  to  Rj  >Xj>Lj,  i  =  1 ....  V' 

Rj>Li+u  i=l,...N-l  (1) 

min{Z.i,...,L;v}  =  0 
max{/?i, . .  .,Rn}  =  b 
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Fig.  2.  Problem  reformulation  in  the  case  of  identical  agents.  The  decision 
variables  are  the  S,'s. 
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Our  design  variables  are  the  L,’s  and  /?,•’ s;  the  constraints 
make  sure  that  the  line  is  completely  covered.  The  following 
lemma  formalizes  the  intuitive  result  that  in  general  explo¬ 
ration  sets  should  not  overlap. 

Lemma  1:  There  exist  an  optimal  solution  for  (1)  satisfy¬ 
ing: 

L\  =  0 ,  Rn  =  b.  Rj  =  Lj+ 1,  i  =  —1. 

Proof:  Although  this  lemma  is  intuitively  clear,  due  to 
the  fact  that  we  are  considering  identical  agents,  it  is  more 
tedious  to  prove  formally.  Consider  an  optimal  solution  for 
(1)  (an  optimal  solution  exists  because  of  the  interpretation 
of  the  problem,  or  alternatively  because  we  are  minimizing 
a  concave  function  over  a  bounded  polyhedron).  Let  j  be  an 
index  such  that  L;-  =  0.  If  /  /  1 ,  we  consider  a  modification  to 
the  solution  such  that  L:  =R\  and  L\  =  0.  Since  the  agents 
are  identical,  the  fact  that  the  leftmost  part  of  the  line  is 
covered  by  agent  j  or  agent  1  does  not  change  the  solution. 
Consequently  we  have  an  optimal  solution  such  that  L\  =  0. 
Similarly  we  can  choose  the  optimal  solution  such  that  Rn  = 
b. 

Now  consider  two  agents  i  and  i  +  1,  i  €  f  1 .  A  —  1  J.  If  R,  > 
L(+ 1  we  consider  a  modification  of  the  solution  (denoted  with 
a  ')  such  that  R'i  =  L-+1  =  R’~^i+1 .  This  modification  implies 
no  change  on  the  other  variables  in  the  original  solution.  The 
interval  [Ll+  \  .Rj]  was  previously  covered  at  cost  2 (Rj  —L,+i) 
in  the  case  where  the  agents  were  not  switching  direction  at 
the  endpoints;  it  is  now  covered  at  cost  (Rj  —  L,+ 1).  In  the 
other  cases  where  one  or  two  agents  switch  direction  to  travel 
to  their  other  endpoint,  we  verify  that  the  cost  is  still  divided 
by  two.  Therefore  the  initial  solution  could  not  be  optimal, 
and  the  lemma  is  proved.  ■ 

We  do  not  have  in  general  unicity  of  the  optimal  solution 
(as  it  will  become  clear  in  the  following,  there  is  an  optimal 
solution  using  only  one  or  two  agents  in  any  case).  However 
the  lemma  is  useful  in  restricting  our  analysis  to  some  natural 
configurations.  The  problem  then  reduces  to  the  following 
(see  Fig.  2):  refering  to  the  point  R,  =  Lj+\  as  the  point 

Rj.  i  =  1, _ N  —  1,  R0  =  0,  Rn  =  b ,  we  want  to  find  the 

positions  of  the  points  Rj  in  order  to  minimize  [ ( A* ,  — 

Ri-i)  +  min(x;  —  Rj-i, Rj  —  x,-)],  which  is  rewritten: 

N 

minimize  b  +  ^ min(x;  —  i,Rj  —  x,)  (2) 

i=  1 

subject  to  Ro  =  0,  Rn  =  b, 

Xj  <  Rj  <  x,+ 1 ,  i=l,...N—l. 


Fig.  3.  Optimal  Exploration  Strategy  to  minimize  the  sum  of  individual 
distances.  The  two  cases  correspond  to  the  leftmost  interval  being  the 
shortest  interval  or  not.  Only  one  agent  switches  direction,  and  only  the 
smallest  initial  interval  between  agents  is  covered  twice. 

Proposition  2:  Let  xo  =  0,  x;v+i  =  b ,  and  let  j  = 
argmin,=o:...,Ar{(x,'+i  —  x,)}.  Then  we  have: 

•  if  j  =  0,  Rj  =  x/+i ,  i  =  1 , . . . , N  is  optimal  for  (2). 

•  if  j  >  0,  Ri=Xj,  i  =  0, . . . ,  (j—  1)  ,Rj=xi+ 1,  i  =  j,...,N 
is  optimal  for  (2). 

The  cost  of  an  optimal  solution  of  (2)  is 

Zd(N)  =  b+  min  {(xi+i  —  x,-)}.  (3) 

i=Q,...,N 

Fig.  3  illustrates  this  optimal  solution,  which  actually 
looks  relatively  clear.  In  words,  we  find  the  smallest  interval 
between  two  consecutive  agents.  Next  we  choose  one  of 
these  two  agents,  which  will  have  to  explore  the  intervals  on 
both  sides  of  its  initial  position.  All  other  agents  will  have 
only  one  interval  to  explore.  Again  we  do  not  have  unicity 
of  the  solution,  the  choice  for  the  directions  of  exploration  is 
arbitrary  for  instance.  It  is  also  clear  that  we  can  perform  the 
task  with  the  same  cost  using  only  one  agent  if  the  shortest 
interval  is  at  the  extremities,  or  two  agents  in  the  other  cases. 
However  our  solution  is  still  interesting  because  it  leads  to  an 
optimal  solution  in  the  discrete  case  as  well:  using  only  one 
or  two  agents  on  the  discrete  line  forces  them  to  travel  over 
sites  that  are  occupied  by  the  other  agents  remaining  idle. 
Whereas  in  the  continuous  model  this  does  not  contribute  to 
an  additional  displacement  cost  because  the  agents  positions 
are  represented  by  points  of  measure  0,  the  additional  cost 
of  this  policy  in  the  discrete  case  appears  clearly. 

Proof:  We  prove  the  following  result  by  induction  on 
N:  let  N  agents  with  initial  positions  xi  <  . . .  <  xn  in  the 
interval  [a,j3],  0  <  a  <  [5  <  b\  then  the  optimal  cost  for  the 
exploration  of  this  interval  is  j 3  —  a  +  mini=o  n  { (x,+ 1  — jc,-)}, 
with  the  convention  xo  =  ot  and  xjv+i  =  j3. 

The  result  is  trivial  for  one  agent.  Now  suppose  the  result 
true  for  all  k  <  N  —  1,  and  we  want  to  prove  it  for  N 
agents,  N  >2.  Since  we  have  N  +  1  intervals  to  explore 
between  the  starting  points,  and  only  N  agents,  at  least  one 
agent  has  to  switch  direction  and  travel  to  Rj  \  and  Rj. 
Consider  the  agents  1  in  increasing  order,  and  call 

p  the  first  agent  to  switch  direction.  Agents  \ .....  p  are 
exploring  [a,Rp\,  agents  p+l,...,N  are  exploring  [Rp, j3], 
and  the  second  group  should  explore  its  part  optimally.  Thus 
the  induction  hypothesis  applies  for  the  second  group  and 


the  total  exploration  cost  is  therefore 


Rp-a  +  min  (xp  -  xp- i  ,RP  —  xp) 

P  Rp  T  mm  [Xp .  ]  Rp,xp-y 2  Xp-p  i x^i)  • 

Now  if  xp  —xp- 1  <  Rp—Xp,  we  obtain  a  cost  of 

P  -  a  +  {xp-Xp-\)  +  mm{xp+l -Rpi...,\ 3  -x#) 

^  /3  ct  T  {xp  Xp— i) 

>B  —  a+  min  {(x,+i  — x,)). 

If  xp  —  Xp- 1  >  —  Xp,  we  obtain  a  cost  of  /3  —  a  + 

Rp  —  Xp  +  min(xp+i  —  Rp,...,fi—  x^).  Considering  two  cases 
for  the  last  term,  where  the  minimimum  achieved  is  either 
(x'p+i  —  Rp)  or  not,  we  see  readily  that  in  any  case  we 
obtain  a  cost  lower  bounded  by  an  expression  of  the  form 
p  —  a  +x/_i  —x,  for  some  ;,  and  therefore  a  lower  bound  on 
the  cost  is  again  p  —  a  +  min,=oi...r/v{(x,+i  —  x,)}. 

Now  it  is  also  easy  to  see  that  the  solution  given  in  the 
proposition  achieves  this  lower  bound,  which  proves  the 
recursion  for  N  agents.  ■ 

III.  Agents  Dropped  Randomly  on  the  Line 

With  the  cost  function  considered  in  the  previous  part, 
clearly  there  is  no  benefit  in  using  multiple  agents  if  we 
have  precise  control  on  the  initial  position  of  these  agents. 
We  obtain  the  best  possible  solution  by  simply  placing  one 
agent  at  0  and  letting  it  travel  to  the  other  end  of  the  route. 
However,  we  are  interested  in  the  case  where  agents  are 
dropped  randomly  on  the  path.  More  precisely,  we  assume 
in  this  part  that  the  initial  positions  are  realizations  of  N  iid 
random  variables  X\ .... .  Ay  having  uniform  distribution  on 
the  interval  [0,b],  Using  more  agents  becomes  beneficial  in 
expectation,  as  we  can  reduce  the  minimum  initial  interval 
between  them,  which  is  the  only  variable  part  in  the  optimal 
cost  (3). 

As  described  in  part  I,  we  define  the  order  statistics 
A'i  .v,  . . . .  Av  iV,  where  the  notation  is  useful  to  keep  track 
of  the  number  of  agents  N.  Now  denote  Z>i  =  Ai  v,  D,  = 
Xj-N  —  Xj- 1 :/v  for  i  =  2,...,N,  and  DN+\  =b  —  XN:N.  The 
variables  I),  are  referred  to  as  spacings.  We  are  interested  in 
the  distribution  and  the  expected  value  of  mini=i  ,v+ 1  {A}- 
The  distribution  of  the  spacings  is  a  classical  result: 

Lemma  3:  Let  X\ .  ,XN  be  iid  random  variables,  uni¬ 
formly  distributed  on  [0,/?].  Let  C\,. . .  ,c^+\  >  0,  such  that 
ci  <  b.  Then  we  have 

P{Dl>cu...,DN+l>cN+1)  =  (  l-y--...-^1)*. 

b  b 

Proof:  For  a  proof  of  this  result  when  b  =  1,  we  refer  for 
example  to  [2].  The  result  of  the  lemma  follows  by  scaling, 
i.e.  dividing  all  the  quantities  by  b  to  obtain  the  base  case. 

■ 

Taking  c\  =  ...  =  cjv+i  =  x,  we  get  P(min1=ii...r/v+i  {A}  > 
x)  =  P(D\  >  x, . . .  ,Dn+  i  >  x)  =  (1  —  (N  +  \){x/b))N  if  0  < 

x  <  -fp—r  So  the  expected  value  of  the  minimum  interval 
length  is 


£[min{A}]  =  f  +  /5(min{D,}  >  x)dx 
i  JO  i 

r  b 

E[min{Df}]=  j+  (1  -  (N+  \)-)Ndx 
i  Jo  b 

£[m;n{A}1 = («T7F 

and  the  expected  optimal  cost  function  is 


Z'(W)  =  ,,(1  +  (STU)  <5) 

It  is  clear  now  that  we  have  a  “saturation”  effect  when  we 
use  more  agents,  since  the  cost  function  goes  asymptotically 
towards  b.  If  we  add  a  penalty  for  using  more  agents,  for 
example  we  add  a  linear  term  caN  to  Zr(N),  we  can  solve 
for  the  optimal  number  of  agents  for  the  task 


N* 


(6) 


where  ca  represents  the  cost  per  agent.  This  solution  has  to 
be  adapted  slightly  in  order  for  N*  to  be  an  integer. 

It  is  interesting  to  note  that  N*  grows  relatively  slowly 
with  b.  For  example,  the  intuition  that  in  order  to  explore  a 
line  of  length  2b  we  need  twice  the  number  of  agents  used 
to  explore  a  line  of  length  b  leads  to  a  large  overestimate. 


IV.  Agents  with  Probabilistic  Behavior: 
Continuous  Model 

A.  Feedback  Strategy 

We  now  turn  to  a  situation  where  we  have  agents  with 
a  probabilistic  behavior  as  described  in  part  I.  First,  we 
consider  the  exploration  problem  with  a  single  agent  in  the 
continuous  model.  Suppose  the  agent  initially  at  xo  moves 
along  the  continuous  version  of  the  line  [0,/?]  with  speed 
u(t)v+  oWt,  where  v  is  a  positive  constant,  u(t )  £  [— 1,  +  1] 
is  the  control,  cr  the  standard  deviation  is  a  positive  constant 
and  W,  is  white  Gaussian  noise  with  unit  power  spectral 
density.  The  position  of  the  agent  follows  the  following 
stochastic  differential  equation: 


dXt  =  u(t)vdt  +  odBt  (7) 

where  Bt  is  1 -dimensional  Brownian  motion.  Since  v  is 
constant,  the  individual  cost  function  can  equivalently  be  the 
time  spent  moving  or  the  distance  travelled. 

Several  strategies  can  be  employed  to  explore  the  line 
with  one  such  agent.  We  first  review  the  optimal  control 
result,  i.e.  a  strategy  minimizing  the  expected  exploration 
time,  that  can  be  implemented  on  an  agent  with  positioning 
or  communication  capacities. 

Proposition  4:  Note  X  (t)  the  position  of  the  agent  on  the 
line.  The  optimal  feedback  law  u[X{t))  to  bring  the  agent  in 
minimum  time  to  0  or  b  is: 

f  w  =  —  1  if  X(t)  €  [0,  |] 

1k  =  +1  if  X(t)€(\,b\ 


The  minimum  expected  time  to  hit  the  boundary  at  0  or  b 
is: 


2^ 


,2  ~VP 

e~^- 


-*o 


2v2  t 


T  -1 

2v(b-x0) 


if  xo  <  % 

—  l]  if  xq  > 


(8) 


Proof:  Let  f(x,t )  =  mmuEx{%  —  t},  where  T  denotes 
the  first  time  the  agent  hits  the  boundary  at  0  or  b ,  and  x 
is  the  initial  position  of  the  agent  (Ex  denotes  the  expected 
value  given  X  (Oj  =  x).  The  dynamic  programming  equation 
for /is  [3]: 


df 

dt 


Cm) 


—  ffH'wue[-  t,+i] 


1  +  uv^L(x,t) 

ox 


La2pf 

2  dx2 


It  is  readily  seen  that  the  minimization  occurs  for  u  = 
— sign  (jjL(x,t)^J,  and  the  twice  continuously  differentiable 
solution  of  the  corresponding  equation  with  boundary  con¬ 
ditions  f(0,t)  =  f(b,t)  =  0  is  given  in  the  proposition.  ■ 
Proposition  4  tells  us  how  to  hit  optimally  the  first  bound¬ 
ary.  It  is  intuitively  clear  that  once  the  first  boundary  has  been 
hit,  the  control  should  remain  constant  telling  the  agent  to 
travel  as  fast  as  possible  to  the  other  boundary.  The  process 
describing  the  agent  position  then  reduces  to  a  Brownian 
motion  with  drift  between  a  reflecting  barrier  (the  boundary 
already  hit)  and  an  absorbing  barrier  (the  second  boundary 
to  reach).  Here  we  add  a  lemma  based  on  the  calculations 
in  [4]  for  this  process,  that  is  useful  to  obtain  closed-form 
results  in  the  following. 


Lemma  5:  Consider  the  agent  subject  to  constant  control, 
moving  towards  a  target  h  =  0  or  b.  If  E{t\xq)  is  the  expected 
time  for  the  agent  to  reach  the  target,  we  have: 


E(t  |x0 )  = 


\h-xo\  a 


2 

2v2 


2v(b-\xa-h\) 


(9) 


Hence,  using  (9)  and  proposition  4  gives  immediately  the 
total  optimal  expected  time  Top,  necessary  to  explore  the  line 
with  one  agent  using  position  feedback.  For  example  in  the 
case  xq  <  b/2  we  get: 


T0pt  — 


b  +  vo  <7 


2 

2v2 


-2v b 

1  —  e  P- 


v(i>-2x0) 

X- 


(10) 


Note  that  the  agent  is  faster  than  in  the  deterministic  case. 


B.  Open-Loop  Strategy  for  One  N on-Compliant  Agent. 

The  implementation  of  the  optimal  policy  requires  that  the 
agent  knows  its  exact  position  on  the  line  at  each  instant,  at 
least  with  respect  to  the  point  b/2.  In  the  rest  of  this  section, 
we  discuss  “open-loop”  strategies,  assuming  the  agents  used 
do  not  have  the  capacities  to  receive  feedback  instructions 
during  the  exploration.  A  straightforward  approach  is  to  see 
at  the  beginning  of  the  mission  which  is  the  endpoint  closest 
to  the  initial  position.  Next,  tell  the  agent  to  move  first 
towards  that  endpoint  until  it  reaches  it,  and  then  towards 
the  other  endpoint  until  it  reaches  it. 

With  a  constant  control,  we  know  that  an  agent  will 
eventually  reach  a  given  target  with  probability  one:  in  the 


discrete  case  for  example,  this  is  given  by  the  ergodicity  of 
the  underlying  Markov  chain  describing  the  position.  Again, 
(9)  can  be  used  to  obtain  the  expected  time  T0i  necessary  to 
finish  the  exploration.  In  the  case  xo  <  b/2  we  get: 


Toi 


b  +  x o 
v 


(j2  r  _  Mp-x o)  _  2 vb 

— 1 +e  o2  —  2e  M 
2v2 


(11) 


The  difference  with  Topt  is  then  found  to  be  bounded  in  every 
case  as  follows: 


which  tells  us  that  we  are  not  loosing  much  if  we  do  not 
implement  any  feedback. 

Because  in  practice  the  agent  has  limited  autonomy,  it  is 
useful  to  know  more  about  the  distribution  of  the  exploration 
time.  Let  us  describe  the  position  of  the  agent  by  the  process 
Xt  starting  at  Xo  =  xq,  and  consider  the  time  necessary  for 
the  agent  to  be  absorbed  at  0  with  a  high  enough  probability, 
treating  b  as  a  reflecting  barrier.  A  bound  on  this  time  is 
obtained  by  considering  an  auxilliary  process  Y,  describing 
the  movement  of  an  agent  with  the  same  dynamics  but  on 
an  infinite  line  (i.e.  without  barriers  at  0  and  h).  We  have 


T°l  Topt  < 


P(X,  =  0|X0  =  x0)  >  P(Y,  <  0|T0  =  xo).  (12) 


The  reason  is  simply  that  before  hitting  0,  both  processes 
have  the  same  behavior;  but  once  X,  hits  0  we  know  for  sure 
that  it  will  remain  there,  whereas  Y,  might  become  positive 
again  after  it  hits  0  for  the  first  time.  From  this  idea  we  get 
the  following  proposition. 


Proposition  6:  Let  0  <  £  <  1  and  a  be  given  by  <t>(a)  = 

,2 

f-ooe~Tdz  =  1  —  £.  Suppose  without  loss  of  generality 
that  the  agent  is  initially  at  xo  <  b/2  and  therefore  told  to 
move  to  0  first.  The  agent  has  reached  0  with  probability  at 
least  1  —  e  for  t  >  to.  where  to  is  given  by 


a2ff2 


(13) 


Proof:  Write  Y,  =  (xq  —  vt)  +  o \ft%-  where  %  is  a 
random  variable  with  standard  normal  distribution.  Then 
we  solve  for  ?[%<  v'aL/l  )  —  1  e-  is  obtained  for 
'-p/f  >  oc.  Solving  for  equality  in  this  inequality,  we  obtain 
for  to: 


to  = 


a<j+  \/a2( 72  +  4xov 
2v 


This  is  simplified  to  obtain  a  lower  bound  as  follows: 
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2v 
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100. 


using  y/x  +  y  <  y/x+  y/y  for  x,y  >  0.  ■ 

oo  _^2  1  _a2 

Remark  7:  For  a  >  0,  we  have  fae  tr dz  <  “2".  In 

particular  when  a  >  1  and  e  <  we  obtain  a  more 
conservative  bound  which  is  easier  to  apply  ,  choosing  a 
such  that  e  <  \j2nz  i.e.  a  >  ,  fl  In 


Vine ' 

Remark  8:  An  exact  expression  for  the  distribution  of 
the  exploration  time  can  be  seen  as  a  special  case  of  the 
calculations  in  [5].  However,  our  bound  above  is  easier  to 
interpret  and  sufficient  for  our  purpose. 


From  the  proposition,  we  can  immediately  conclude  that 
for  0  <  £  <  1/2  and  the  agent  going  to  0  and  then  to  b, 
the  line  will  be  completely  explored  with  probability  at  least 
1  —  2e  for  t  >  t\,  where: 

b  +  xn  oca 

h  = - +  — 

V'  V 

Note  that  this  simple  open-loop  strategy  is  asymptotically 
optimal  in  the  limit  where  Z?  — >  since  (b+x o)/v  is  the  time 
for  a  deterministic  agent  to  explore  the  line.  Also,  at  the  limit 
when  a  =  0,  i.e.  e  =  1/2,  we  obtain  the  same  speed  as  in 
the  deterministic  case,  provided  we  can  be  satisfied  with  a 
very  low  probability  of  success.  In  fact  the  bound  is  not  tight 
due  to  the  crude  use  of  Boole’s  inequality  to  obtain  1  —  2e 
for  the  probability  of  success  in  the  two  successive  travel 
periods.  If  we  look  only  at  the  travel  from  xq  to  0  in  the 
proposition,  we  see  that  in  fact  we  can  be  faster  than  in  the 
deterministic  case  by  allowing  e  to  be  greater  than  1/2,  i.e. 
a  <  0.  This  appears  natural  as  we  can  exploit  the  possibility 
that  the  speed  might  take  values  well  above  its  mean. 


b\  2 
v , 


/xo  \ 
V  v  / 


2crct 
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Fig.  4.  Expected  exploration  reward.  Parameters:  v=  \  .a  —  2.  h  —  1 00.  R 
1000,  yi  =  10-2,  Yi  =  1.  The  optimal  reward  Zop,  =  94  is  obtained  for  e  = 


0.21. 


Consider  the  following  scenario.  We  assume  that  we  col¬ 
lect  an  expected  reward  which  is  a  function  of  the  expected 
time  to  finish  the  exploration.  Hence,  let  us  consider  an 
expected  reward  of  the  form  Re^ i?°/(1“e))  (more  precisely, 
this  should  be  a  lower  bound  on  the  expected  reward  that 
we  can  achieve).  There  is  also  a  linear  cost  yjtu  associated 
to  the  use  of  agents  with  a  greater  autonomy  which  are  more 
expensive.  The  total  expected  reward  is  then 

Z(e)  =Re~x^  -72f0(e),  R,YuYi>0  (14) 


C.  Agents  with  Limited  Autonomy. 

Suppose  we  have  an  infinite  number  of  non-compliant 
agents  that  we  can  use  to  explore  the  interval  [0,Z>],  all 
starting  from  b  at  the  beginning  of  the  mission.  The  mission 
terminates  when  an  agent  reaches  0.  These  agents  work  under 
the  open-loop  policy  described  in  the  previous  paragraph, 
since  it  was  argued  that  in  general,  adding  position  feedback 
does  not  increase  dramatically  the  performance.  We  illustrate 
in  this  part  applications  of  the  previous  results  for  two  multi¬ 
agent  exploration  scenarios. 

If  every  agent  can  only  run  for  a  time  to,  the  line 
segment  exploration  problem  can  be  seen  as  a  Monte-Carlo 
algorithm[6];  that  is,  the  algorithm  might  sometimes  produce 
an  incorrect  answer  but  we  are  able  to  bound  the  probability 
of  that  incorrect  answer  using  (13).  The  running  time  of  this 
“algorithm”  is  guaranteed  to  be  to(e)  (where  the  notation  is 
showing  the  dependance  in  e  explicitely)  and  the  probability 
of  the  result  being  correct  is  at  least  1  —  e.  To  improve  the 
probability  of  success  of  a  Monte-Carlo  algorithm,  we  simply 
run  it  repetetively,  trading-off  running  time.  This  means  for 
our  task  that  we  can  send  multiple  agents  successively,  and 
let  each  of  them  run  for  to,  until  one  of  them  finishes  the 
task.  The  expected  time  it  takes  to  finish  the  exploration  is 
then  upper  bounded  by  fo(e)/(l  —  e),  since  the  sucess  event 
follows  a  geometric  distribution  with  parameter  1  —  e. 


Since  we  can  use  multiple  agents  to  finish  the  task,  we 
will  not  necessarily  need  to  require  a  high  probability  for 
one  agent  to  finish  correctly.  There  is  a  trade-off  between  the 
development  of  better  agents  with  a  greater  autonomy  and 
the  reward  that  we  can  collect  from  them.  Moreover,  with 
multiple  agents  we  should  be  able  to  use  the  cases  where  the 
random  component  of  the  speed  allows  a  faster  execution. 
Therefore,  the  optimal  e  increases  with  the  variance  a. 

Various  shapes  can  be  obtained  for  the  function  Z(e)  for 
different  choices  of  parameters.  Fig.  4  is  an  illustration  for 
specific  values. 

Once  we  have  computed  the  optimal  e,  we  might  be 
interested  in  knowing  how  many  agents  will  actually  be  nec¬ 
essary  to  perform  the  exploration  in  practice.  As  mentionned 
earlier,  the  expected  number  of  agents  used  until  one  of  them 
finishes  is  1/(1— e).  Standard  Chernoff  arguments  apply 
to  the  corresponding  geometric  sequence  to  show  that  the 
number  of  agents  used  will  be  close  to  this  expectation  with 
high  probability.  Looking  back  at  the  example  illustrated  on 
Fig.  4,  we  obtain  an  optimal  number  of  agents  of  about 
1.26.  Obviously  a  number  of  different  scenarios  can  be 
studied  in  a  similar  way,  but  this  tells  us  again  that  the  cost 
of  using  multiple  agents  should  be  included  in  reasonable 
models,  because  the  saturation  effect  already  encountered  in 
the  previous  part  can  be  dominant. 


V.  Agents  with  Probabilistic  Behavior:  Discrete 
Model 

In  this  part  we  extend  some  results  of  the  previous  section 
to  the  discrete  model.  These  results  would  translate  directly 
to  the  discrete  case,  except  that  using  the  central  limit 
theorem  to  determine  the  bound  on  the  expected  exploration 
time  would  only  give  us  a  result  in  the  limit  b  — >  However, 
a  concentration  inequality  allows  us  to  obtain  finite-time 
bounds. 

A.  Optimal  Closed-Loop  Policy  for  a  Single  Agent 

The  behavior  of  a  non-compliant  agent  in  the  discrete  case 
was  described  in  part  I.  Remember  that  now  we  want  to 
explore  a  line  graph  with  b+  1  vertices.  An  agent  moves 
on  the  line  following  a  controlled  random  walk:  at  a  given 
period,  it  goes  in  the  required  direction  with  probability  p, 
stays  where  it  is  with  probability  1  —  p  —  q  and  goes  in  the 
opposite  direction  with  probability  q.  The  characteristics  of 
the  boundaries  are  as  follows:  if  the  agent  is  at  0  and  told 
to  go  left,  it  will  remain  where  it  is  with  probability  1  —  q 
and  go  right  with  probability  q.  If  told  to  go  right,  it  will  go 
right  with  probability  p  and  stay  at  0  with  probability  1  —  p 
(asume  p  >  q).  The  boundary  at  b  is  described  symmetrically. 
Suppose  that  a  single  agent  is  initially  at  site  xo  on  the 
discrete  line,  and  that  we  want  to  explore  the  line  while 
minimizing  the  expected  exploration  time.  To  determine  the 
optimal  policy  minimizing  the  expected  cover  time  for  the 
corresponding  controlled  Markov  chain,  we  can  use  a  stan¬ 
dard  dynamic  programming  approach.  This  is  summarized 
in  the  following  proposition,  which  parallels  the  continuous 
case.  It  can  be  proved  using  the  value  iteration  method  as 
described  in  [7],  for  a  stochastic  shortest  path  problem  on 
a  finite  number  of  states.  Under  these  conditions.  Bellman’s 
equation  holds.  Since  the  result  is  now  intuitively  clear  and 
the  proof  is  straightforward  but  lengthy,  we  omit  it. 

Proposition  9:  The  optimal  policy  to  explore  the  discrete 
line  in  minimum  expected  time  with  a  non-compliant  agent 
is  to  always  send  the  agent  towards  the  nearest  still  unvisited 
endpoint. 

Since  we  know  the  optimal  policy,  we  can  compute  the 
corresponding  optimal  expected  cost  (at  least  numerically)  as 
a  solution  of  the  linear  system  corresponding  to  Bellman’s 
equation,  where  we  know  the  result  of  the  minimization  for 
each  state.  Solving  the  linear  system  analytically  is  difficult 
compared  to  the  continuous  case  calculation,  and  instead  we 
simply  provide  a  lower  bound  result  analogous  to  lemma  5 

Lemma  10:  Let  E(z\xo)  be  the  optimal  expected  travel 
cost  for  a  single  non-compliant  agent  initially  at  site  xq  and 
moving  under  constant  control  towards  0.  We  have 


E(z\x0)  > 


*o 

1-2  q 


q 


b+ 1 


(\-q)b(l-2q)2 


AT 

(15) 


Proof:  This  lower  bound  is  obtained  as  follows:  the 
dynamics  of  the  agent  follow  a  random  walk  between  an 


absorbing  barrier  (the  endpoint  to  reach)  and  a  reflecting 
barrier  (the  endpoint  already  visited).  The  expected  time  is 
obtained  as  a  solution  of  the  corresponding  subsystem  in 
Bellman’s  equation.  For  the  case  q=  1  —  p,  this  system  was 
solved  in  [8],  [9].  In  our  case  however,  we  can  have  p+q  <  1. 
But  if  we  consider  the  random  walk  with  parameters  p\,q\ 
such  that  q\  =  q  and  p\  =  \—q  (i.e.  when  the  agent  would 
remain  idle  in  the  original  process,  in  the  modified  process  it 
moves  in  the  right  direction),  we  obviously  reach  the  target  in 
a  shorter  time.  The  lower  bound  given  in  the  proposition  can 
therefore  be  obtained  from  [8],  for  our  case  where  q  <  1/2. 

■ 

This  result  can  be  used  as  before  to  argue  that  adding 
position  feedback  does  not  add  a  lot  to  the  performance  of 
the  agent.  This  is  because  even  in  the  optimal  case,  the  agent 
will  have  to  travel  from  the  first  hit  boundary  to  the  second 
one,  and  on  this  phase  there  is  no  difference  between  open- 
loop  and  closed-loop  strategy.  Using  (15),  we  know  then  that 
the  optimal  policy  will  have  a  cost  of  at  least  b/(  1  —2 q). 
During  the  first  phase,  we  can  expect  from  the  continuous 
model  result  that  the  feedback  performance  is  also  relatively 
close  to  the  open-loop  performance.  We  do  not  make  the 
argument  more  formal  here. 

B.  Open-Loop  Policy 

As  in  the  continuous  case,  we  consider  simple  open-loop 
policies  that  are  in  practice  a  lot  easier  to  implement  and 
should  perform  relatively  well  with  respect  to  the  optimum. 
So  for  an  agent  with  limited  autonomy,  we  tell  the  agent 
to  go  towards  the  closest  endpoint  for  a  fixed  maximum 
number  of  steps,  and  then  to  switch  direction  and  go  towards 
the  other  endpoint  again  for  a  fixed  number  of  steps.  If  the 
agent  has  infinite  autonomy,  it  goes  in  each  direction  until 
it  reaches  its  target,  which  happens  with  probability  one. 
The  implementation  of  the  policy  only  involves  mission  pre¬ 
planning  and  no  online  re-planning. 

We  can  derive  a  result  analogous  to  the  bound  on  the 
exploration  time  (13)  in  the  continuous  model.  Consider 
an  agent  travelling  under  constant  control  from  its  initial 
position  xo  towards  0.  If  Xn  represents  the  position  at  time  n 
of  the  agent  moving  between  the  two  barriers  (absorbing 
at  0,  reflecting  at  b ),  and  Yn  is  the  position  of  an  agent 
starting  from  xo  and  moving  on  an  infinite  discrete  line 
which  is  an  extension  of  our  interval,  with  the  same  transition 
probabilities  as  Xn  (but  without  barriers),  we  have: 

P(Xn  =  0|X0  =  x0)  >  P(Yn  <  0|T0  =  xo) ,  Vn 

Define  Zi,Z2,...  iid  random  variables  with  P(Z,  =  —  1)  = 
p.  P{Zj  =  0)  =  1  —  p  —  q,  P{Zj  =  1 )  =  q.  Then  we  have: 

Yo  =xo 

n 

Yn  =  Y0  +  'Ezi>  n^1 

i—1 

Let  tf  and  cr  be  the  mean  and  the  variance  of  Z,. 

P  =  —  P  +  q  ,  a  =  p  +  q  +  2 pq  —  p2  -q2 


We  assume  p  >  q  and  therefore  p  <  0.  Notice  that  v  =  \p\. 
With  these  notations,  we  have: 


Proposition  11:  Let  0  <  £  <  1,  and  a  =  y  2/«i.  Assume 
without  loss  of  generality  that  the  non-compliant  agent  starts 
at  0  <  xo  <  [*J.  Then  the  agent  moving  under  constant 
control  towards  0  has  reached  0  with  probability  at  least 
(1  —  e)  for  n  >  no,  with 


xo 

n  o  =  — 

V' 


acr/vo\3  a2  a2  1 

- (  —  )  “l - 2 - ^  T~ 

v  V  v  /  vz  3v 


(16) 


If  i  =  0,  obviously  we  take  hq  =  0  since  it  means  that 
we  start  at  the  absorbing  barrier.  Note  the  similarity  to  the 
expression  obtained  in  the  continuous  case,  in  particular 
when  we  use  the  expression  for  a  mentionned  in  remark  7.  In 
the  limit  where  xo  is  large,  we  have  no  =  ^  (1  +o(l)).  If  we 
interpret  v  =  \p  |  as  the  mean  speed  of  the  agent,  this  result 
says  that  asymptotically  for  xo  and  b  large  we  do  not  have  to 
wait  a  lot  more  in  the  stochastic  case  than  in  a  deterministic 
situation  where  we  have  an  agent  moving  at  speed  v. 

Proof:  Let  S„  =  YJ=\  Z,-.  We  will  use  Bernstein’s 

inequality  for  our  distribution  on  Z,  (see  for  example  [10] 
for  a  survey  of  concentration  inequalities): 


V5  >  0,  P(Sn  —  pn  >  Sn)  <  exp 


n82  \ 

2a2  +  28/3  ) 


(17) 


Since  p  <  0,  we  can  choose  no  integer  such  that  nop  < 
— xo .  Then  let  8  =  —  ^  —  tt,  we  have  8  >  0. 

Now  let  n  be  an  integer,  n  >  no-  Then  we  have  [— xo,  +°°)  C 
[-x0^,+°°)  therefore  P(Sn  >  -x0)  <P(Sn  >  -x0^).  More¬ 
over,  using  (17)  and  our  definition  of  5,  we  have 


P(S„  >  -x0  —  )  =  P(Sn 
n0 

P(S„  >  —xo—)  <  exp  ( 
n  o  \ 


nP  >  (-—-ll) 
n0 

"(%+P)2  \ 

2  °2-U%+P)J 


Note  that  with  our  constraint  on  8,  2 a2  —  l(  —  +  u)  >  0. 

J  v  HQ  r  ' 

Since  n  >  no  and  P(Yn  >  0|Lo  =xo)  =P(Sn  >  —  xo),  we  obtain 
finally 


P(X„  0|Zq  =  xq)  <  exp  - 


(xo  +  nop)2 


2n0(J2  I  (x0  +  n0p) 


(18) 


To  obtain  P(Xn  ^  0|Zo  =  xo)  <  e  for  e  >  0,  it  is  sufficient 
to  have 

(■ xp  +  npp )2  >  jnI 

2no<72— |(xo  +  «oM)  _  e 

We  obtain  the  value  for  no  given  in  the  proposition  by 
considering  only  the  solution  greater  than  xq / 1  /-/ 1 .  The  final 
expression  is  simplified  as  in  the  proof  of  proposition  6. 


VI.  Conclusions 

Two  simple  multi-agent  line  exploration  problems  were 
considered  in  this  paper.  The  optimal  policy  for  exploring 
the  line  with  N  agents  seeking  to  minimize  the  sum  of 
their  travelled  distances  was  obtained.  For  agents  dropped 
randomly  on  the  line,  it  was  shown  that  adding  a  cost 
proportional  to  the  number  of  agents  leads  to  an  optimal 
number  of  agents  to  use  for  the  task.  In  a  second  part, 
we  considered  an  exploration  problem  using  non-compliant 
agents  with  limited  autonomy.  Again  it  was  argued  that  the 
number  of  agents  used  to  perform  a  given  task  should  be 
considered  as  an  important  question.  In  practice  using  more 
agents  has  an  associated  cost  and  might  not  always  lead  to 
a  dramatic  increase  in  the  final  performance. 
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Since  proposition  1 1  is  almost  identical  to  proposition  6, 
it  follows  that  our  discussion  on  multi-agent  exploration  in 
the  continuous  model  is  valid  for  the  discrete  model  as  well. 


